Search CORE

22 research outputs found

Discovery and genotyping of structural variation from long-read haploid genome sequence data

Author: Boitano Matthew
Chaisson Mark J.P.
Chin Chen-Shin
Eichler Evan E
Gordon David
Graves-Lindsay Tina A
Hoekzema Kendra
Huddleston John
Korlach Jonas
Kronenberg Zev N
Munson Katherine M
Peluso Paul
Steinberg Karyn Meltz
Vives Laura
Warren Wes
Wilson Richard K
Publication venue: Digital Commons@Becker
Publication date: 01/01/2016
Field of study

Improved assembly and variant detection of a haploid human genome using single-molecule, high-fidelity long reads

Author: Audano Peter A.
Baker Carl
Concepcion Gregory T.
Eichler Evan E.
Hunkapiller Michael W.
Kronenberg Zev N.
Lansdorp Peter M.
Logsdon Glennis A.
Munson Katherine M.
Peluso Paul
Porubsky David
Sanders Ashley D.
Spierings Diana C. J.
Sulovari Arvis
Surti Urvashi
Vollger Mitchell R.
Wenger Aaron M.
Publication venue: 'Wiley'
Publication date: 01/03/2020
Field of study

The sequence and assembly of human genomes using long-read sequencing technologies has revolutionized our understanding of structural variation and genome organization. We compared the accuracy, continuity, and gene annotation of genome assemblies generated from either high-fidelity (HiFi) or continuous long-read (CLR) datasets from the same complete hydatidiform mole human genome. We find that the HiFi sequence data assemble an additional 10% of duplicated regions and more accurately represent the structure of tandem repeats, as validated with orthogonal analyses. As a result, an additional 5 Mbp of pericentromeric sequences are recovered in the HiFi assembly, resulting in a 2.5-fold increase in the NG50 within 1 Mbp of the centromere (HiFi 480.6 kbp, CLR 191.5 kbp). Additionally, the HiFi genome assembly was generated in significantly less time with fewer computational resources than the CLR assembly. Although the HiFi assembly has significantly improved continuity and accuracy in many complex regions of the genome, it still falls short of the assembly of centromeric DNA and the largest regions of segmental duplication using existing assemblers. Despite these shortcomings, our results suggest that HiFi may be the most effective standalone technology for de novo assembly of human genomes

Proceedings - University of Groningen

Crossref

University of Groningen

ARTS repository - University of Groningen

MDC Repository

Dissertations of the University of Groningen

A spectrum of free software tools for processing the VCF variant call format: vcflib, bio-vcf, cyvcf2, hts-nim and slivar.

Author: Brent S Pedersen
Eric T Dawson
Erik Garrison
Pjotr Prins
Zev N Kronenberg
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/05/2022
Field of study

Since its introduction in 2011 the variant call format (VCF) has been widely adopted for processing DNA and RNA variants in practically all population studies-as well as in somatic and germline mutation studies. The VCF format can represent single nucleotide variants, multi-nucleotide variants, insertions and deletions, and simple structural variants called and anchored against a reference genome. Here we present a spectrum of over 125 useful, complimentary free and open source software tools and libraries, we wrote and made available through the multiple vcflib, bio-vcf, cyvcf2, hts-nim and slivar projects. These tools are applied for comparison, filtering, normalisation, smoothing and annotation of VCF, as well as output of statistics, visualisation, and transformations of files variants. These tools run everyday in critical biomedical pipelines and countless shell scripts. Our tools are part of the wider bioinformatics ecosystem and we highlight best practices. We shortly discuss the design of VCF, lessons learnt, and how we can address more complex variation through pangenome graph formats, variation that can not easily be represented by the VCF format

Directory of Open Access Journals

PubMed Central

Recommended from our members

Genomic Patterns of De Novo Mutation in Simplex Autism.

Author: Coe Bradley P
Darnell Robert B
Dickel Diane E
Eichler Evan E
Hoekzema Kendra
Hormozdiari Fereydoun
Kronenberg Zev N
Nelson Bradley J
Pennacchio Len A
Raja Archana
Turner Tychele N
Zody Michael C
Publication venue: eScholarship, University of California
Publication date: 01/10/2017
Field of study

To further our understanding of the genetic etiology of autism, we generated and analyzed genome sequence data from 516 idiopathic autism families (2,064 individuals). This resource includes >59 million single-nucleotide variants (SNVs) and 9,212 private copy number variants (CNVs), of which 133,992 and 88 are de novo mutations (DNMs), respectively. We estimate a mutation rate of ∼1.5 × 10-8 SNVs per site per generation with a significantly higher mutation rate in repetitive DNA. Comparing probands and unaffected siblings, we observe several DNM trends. Probands carry more gene-disruptive CNVs and SNVs, resulting in severe missense mutations and mapping to predicted fetal brain promoters and embryonic stem cell enhancers. These differences become more pronounced for autism genes (p = 1.8 × 10-3, OR = 2.2). Patients are more likely to carry multiple coding and noncoding DNMs in different genes, which are enriched for expression in striatal neurons (p = 3 × 10-3), suggesting a path forward for genetically characterizing more complex cases of autism

eScholarship - University of California

Wham: Identifying Structural Variants of Biological Consequence

Author: Brett J. Kennedy (762242)
Edward J. Osborne (832145)
Eric T. Domyan (832147)
Kelsey R. Cone (832146)
Mark Yandell (61663)
Michael D. Shapiro (209706)
Nels C. Elde (734902)
Zev N. Kronenberg (832144)
Publication venue
Publication date: 01/12/2015
Field of study

<div>Existing methods for identifying structural variants (SVs) from short read datasets are inaccurate. This complicates disease-gene identification and efforts to understand the consequences of genetic variation. In response, we have created Wham (Whole-genome Alignment Metrics) to provide a single, integrated framework for both structural variant calling and association testing, thereby bypassing many of the difficulties that currently frustrate attempts to employ SVs in association testing. Here we describe Wham, benchmark it against three other widely used SV identification tools–Lumpy, Delly and SoftSearch–and demonstrate Wham’s ability to identify and associate SVs with phenotypes using data from humans, domestic pigeons, and vaccinia virus. Wham and all associated software are covered under the MIT License and can be freely downloaded from github (<a href="https://github.com/zeeev/wham" target="_blank">https://github.com/zeeev/wham</a>), with documentation on a wiki (<a href="http://zeeev.github.io/wham/" target="_blank">http://zeeev.github.io/wham/</a>). For community support please post questions to <a href="https://www.biostars.org/" target="_blank">https://www.biostars.org/</a>.</div

Directory of Open Access Journals

PubMed Central

FigShare

Sensitivity and false discovery rates (FDR) for simulated data.

Author: Brett J. Kennedy (762242)
Edward J. Osborne (832145)
Eric T. Domyan (832147)
Kelsey R. Cone (832146)
Mark Yandell (61663)
Michael D. Shapiro (209706)
Nels C. Elde (734902)
Zev N. Kronenberg (832144)
Publication venue
Publication date
Field of study

The sensitivity and FDR of Delly, Lumpy, SoftSearch and Wham for simulated deletions, duplications, insertions and inversions. The sensitivity is measured for each category at depths of 10x and 50x. SVs ranging from 50 bp to 1 Mb are grouped into four left-closed size intervals. A) The sensitivity of the three tools is faceted on size, depth and SV type. At 10x Wham has noticeably better sensitivity for deletions and duplications in the smallest size class. Wham’s sensitivity is higher than Delly and Lumpy for insertions at 10x and gains sensitivity at 50x. B) The FDR for each type of SV faceted by depth and the amount of slop added to each confidence interval. In the 25 bp slop category, each confidence interval was extended in both directions by 25 bp. At 10x depth Wham has the highest FDR across all SV classes and Lumpy has the lowest. At 50x Delly has heightened FDR for deletions and Lumpy has a much higher FDR for insertions. Shrinking the confidence intervals increases the FDR for Delly and Lumpy, but not Wham. C) Breakpoint sensitivity for deletions. The confidence intervals, provided by the three tools are ignored and slop is incrementally added to the predicted breakpoints. Wham has the highest sensitivity when 1–10 bp of slop is added. D) Genotype sensitivity for the homozygous non-reference simulated SVs. Delly and Wham have similar sensitivity for deletions and duplications while both tools fail to correctly genotype duplications.</p

FigShare

Extended haplotype-phasing of long-read de novo genome assemblies using Hi-C

Author: Concepcion Gregory T
Eichler Evan E
Fedrigo Olivier
Hall Richard J
Hiendleder Stefan
Jarvis Erich D
Kingan Sarah B
Koren Sergey
Kronenberg Zev N
Kuhn Kristen
Liachko Ivan
Low Wai Yee
Mueller Kathryn A
Munson Katherine M
Peluso Paul
Phillippy Adam M
Porubsky David
Rhie Arang
Smith Timothy P.L.
Sullivan Shawn T
Williams John L
Publication venue: DigitalCommons@University of Nebraska - Lincoln
Publication date: 10/08/2021
Field of study

Haplotype-resolved genome assemblies are important for understanding how combinations of variants impact phenotypes. To date, these assemblies have been best created with complex protocols, such as cultured cells that contain a single-haplotype (haploid) genome, single cells where haplotypes are separated, or co-sequencing of parental genomes in a triobased approach. These approaches are impractical in most situations. To address this issue, we present FALCON-Phase, a phasing tool that uses ultra-long-range Hi-C chromatin interaction data to extend phase blocks of partially-phased diploid assembles to chromosome or scaffold scale. FALCON-Phase uses the inherent phasing information in Hi-C reads, skipping variant calling, and reduces the computational complexity of phasing. Our method is validated on three benchmark datasets generated as part of the Vertebrate Genomes Project (VGP), including human, cow, and zebra finch, for which high-quality, fully haplotyperesolved assemblies are available using the trio-based approach. FALCON-Phase is accurate without having parental data and performance is better in samples with higher heterozygosity. For cow and zebra finch the accuracy is 97% compared to 80–91% for human. FALCON-Phase is applicable to any draft assembly that contains long primary contigs and phased associate contigs

DigitalCommons@University of Nebraska

Wham detects structural variation in vaccinia virus populations.

Author: Brett J. Kennedy (762242)
Edward J. Osborne (832145)
Eric T. Domyan (832147)
Kelsey R. Cone (832146)
Mark Yandell (61663)
Michael D. Shapiro (209706)
Nels C. Elde (734902)
Zev N. Kronenberg (832144)
Publication venue
Publication date
Field of study

A) Read depth normalized within each sample is plotted across the ~200 kb vaccinia genome (excluding inverted terminal repeats) for either the parental strain (top panel) or an adapted strain (middle and bottom panels, called by Wham or Lumpy, respectively). Arrows highlight the positions of K3L CNV and E3L deletion. The black lines represent the breakpoints of every SV call after filtering (see <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1004572#sec016" target="_blank">Supporting Information</a>). B) Wham calls in the adapted strain near the K3L duplication breakpoint are shown as black triangles above the viral genes in colored boxes. The height of the triangle represents split-read (SR) count supporting the call. Sanger sequencing positions relative to the reference sequence are listed below. Asterisks (*) indicate Wham calls that match the exact breakpoint determined by Sanger sequencing (see <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1004572#pcbi.1004572.s004" target="_blank">S3 Table</a> for Wham and Lumpy breakpoints). C) Wham calls in the adapted strain near the E3L deletion are shown above the genes, and Sanger sequence confirmed positions below, as in B. The arrow indicates the position of the 11K promoter driving β-gal expression. For breakpoints in grey, the height of the triangle indicates the relative mate-pair count from Wham, as these positions do not have SR support.</p

FigShare

Benchmarking Delly, Lumpy, SoftSearch and Wham against NA12878 and CHM1 datasets.

Author: Brett J. Kennedy (762242)
Edward J. Osborne (832145)
Eric T. Domyan (832147)
Kelsey R. Cone (832146)
Mark Yandell (61663)
Michael D. Shapiro (209706)
Nels C. Elde (734902)
Zev N. Kronenberg (832144)
Publication venue
Publication date
Field of study

A) The sensitivity and FDR for filtered NA12878 Phase III deletion calls across four size intervals. The number of true positives and the number NA12878 calls are listed above sensitivity, while the total number of false positives and total calls for each tool is listed above FDR. Most true positives and false positives are within the 150–1,000 bp interval. B) The sensitivity and FDR for CHM1 deletions. C) The size distribution of the true positive calls that overlap the CHM1 deletions. One thousand true positives were randomly sampled from each tool and the truth set (CHM1-DEL).</p

FigShare